Incident Response

If you are not a member of the Mobile Platform team and need to report a mobile incident:
post a message in the va-mobile app channel, tagging @mobile-incident-response.

For members of the Core Mobile Platform team:
This document outlines the process for responding to critical VAHB incidents reported during business hours (Monday - Friday, 9am ET - 5pm PT).

For more explicit guidance around recurring issues, refer to Mobile Plays and Postmortems.

Step one - Acknowledge the incident

Regardless of the method by which the incident has been detected, take the following steps to acknowledge the incident and bring the incident response crew together.

Acknowledge the issue in #va-mobile-app if it has not already been posted and tag @mobile-incident-response
Identify the Incident Commander (Default to Michelle Middaugh, Backup: Ryan Thurlwell). This person will be the a primary point of contact and will be responsible for keeping Leadership updated with the current status of an active incident at least every hour with the following:
1. The current state of the service
2. Remediation steps taken
3. Any new findings since the last update
4. Theory eliminations (i.e. ‘What have we determined is not the cause?’)
5. Anticipated next steps
6. ETA for the next update (if possible)
Identify the lead engineer (Default to Jon Bindbeutel, Backup: John Marchi)
The IC creates a temporary Slack channel using conventional naming such as 010826-mobile-incident, tagging @mobile-incident-response. This channel will be dedicated to incident conversations and can be closed following post mortem and retro.
Return to the original incident post in #va-mobile-app and thread the name of the Incident commander and a link to the incident channel
Open an incident room/bridge call and post in the temp channel

Step two - Determine impact

If it is a widespread outage affecting both web and mobile:

Acknowledge the issue in #va-mobile-app and tag @mobile-incident-response
Check the #oncall and #vfs-platform-support channels. If OCTO or Platform staff are aware of and addressing the incident, stand by and monitor for escalation of any mobile-specific questions. Report the incident if no post exists.

If the incident is an external outage:

Acknowledge the issue in #va-mobile-app and tag @mobile-incident-response and the relevant Experience team(s)
Check the #oncall and #vfs-platform-support channels. If OCTO or Platform staff are aware of and addressing the incident, stand by and monitor for escalation of any mobile-specific questions. Report the incident if no post exists.

If mobile only, first assess whether this is a security incident: Security incidents are prioritized as Critical and need to be escalated as described below.

Has the system has been compromised by a third party?
Has there been a leak of personally identifiable information?
Is the system under attack?

If yes to any of the above, the incident is considered a critical security incident. If this is the case, immediately inform VA Product Owners who will escalate the incident to Tier 3 following the steps in the Security Incidents documentation. Critical protocol applies.

If Mobile only and not a security incident: Determine the impact using the below matrix and with consideration for the user experience. An incident may be more severe if the app does not respond gracefully or present appropriate error messaging.

Incident level	Qualifications for level	Escalation steps
Critical incident	Confidentiality or privacy is breached OR Veteran data is at risk OR Total product outage OR Most or all users are impacted and the feature or functionality has direct impact on patient care or financial systems	Incident Commander will declare an incident in #vfs-platform-support if one has not already been created, and with the “Incident” label. Please follow the Incident Handling Steps in the VFS Incident Playbook for instructions. The VA Product Owner will initiate the Major Incident Management Playbook.
Priority/Major incident	Most or all users are impacted and there is risk to patient care or financial systems core functionality is significantly impacted	Incident Commander will declare an incident in #vfs-platform-support if one has not already been created, and with the “Incident” label. Please follow the Incident Handling Steps in the VFS Incident Playbook for instructions. The VA Product Owner will initiate the Major Incident Management Playbook.
Moderate/Medium incident	there is direct impact or risk to patient care or financial systems for a limited number of Veterans (> 2,500 Veterans/day) OR one or more services is unavailable to 10% or less of veterans	As a rule, no escalation is needed for medium incidents but the Incident Commander can choose to escalate at their discretion.
Minor/Low incident	Impact is very localized or affects few Veterans Some functionality is unavailable, but there’s a workaround The app does not look the way it should, but it doesn’t affect functionality Degradation in service to customers that are not Veterans	As a rule, no escalation is needed for minor incidents but the Incident Commander can choose to escalate at their discretion.

Additional Incident Matrix documentation: VA Major Incident Management Team Matrix

Step three - Communicate with Veterans and Stakeholders

In addition to the Incident Commander's communication requirements, consider others you may need to keep informed, including:

Veterans Use appropriate tooling and communication channels to ensure Veterans are aware of the issue as necessary and do not spend time doing work that will be lost. This may include:

Adding an availability framework message (FE or BE)
Disabling a given feature via remote config

Stakeholders Ensure that your VA points of contact are informed and aware of the issue, its impact, and expected resolution timeline. Please include a link to issues and/or slack conversations so that they may keep up to date on progress.

Step four - Diagnosis and determine path to resolution

Determine the root cause
1. Go wide, then go deep. First verify the overall state of the system. Is it a particular API endpoint that is unavailable, or the whole API? Isolate what parts of the system are actually affected.
2. Rule out external factors. Given the nature of vets-api as a facade over other VA systems, check whether any relevant upstream dependencies are unavailable or experiencing elevated error rates. In this case the likely response process will be to notify the team responsible for the upstream system, and if possible, set a maintenance window in PagerDuty to trigger the downtime notification mechanism.
3. Don't forget about non-API dependencies such as the VA network gateways that sit in front of our backend services. If these are experiencing an issue it's almost certainly also affecting VA.gov and possibly the VA as a whole.
4. Look for what changed. If a behavior began suddenly, determine if anything changed recently that can account for it - did an API deployment occur? Did the Web Platform operations team change some infrastructure?
5. Look inward. Is there something we did that caused the issue? Sometimes certs expire and upstream terms of services need to be accepted without our knowledge.
6. Refer to Triaging an Incident steps as necessary
The incident commander should capture notes, discussions, and other items (screenshots, log messages, etc.) that can act as a part of the record of the incident for later reporting to stakeholders, and to assist the team when a retrospective is conducted.
Consider the Amount of time to next natural release
Recommendation for mitigation and any applicable execution
The Incident Commander continues to communicate the current state and process in the incident thread with relevant details

Step Five - After incident

A post-mortem report and retro are required for incidents which

are determined to be Critical or High priority, including those which involve the security of a system or a Veteran's data
require the use of a playbook
impact a significant portion of the userbase
persists for a significant period of time
requires out-of-app coordination or an out-of-band app release
OCTO leadership requests

Post Mortems The goal of the postmortem is not to assign blame, but to improve our ability to prevent, detect, and respond to future incidents. Possible follow-up actions that may result include adding additional monitoring, adding implementation safeguards, tuning alerts, adding documentation, and refining inter-team communication processes.

Follow the instructions to create a postmortem document and get a draft up within 24 hours.
Follow the instructions for the Incident Retrospective process
Ensure that your team’s VA points of contact are aware of the incident resolution and given a chance to review the post-mortem as well as the VA Platform’s points of contact (Steve Albers and Erika Washburn)
If the incident is such that it may occur again in the future – or if it follows a theme common to incidents in the past – add the incident and the steps to resolve it as a Mobile Incident Response Play.

Step one - Acknowledge the incident​

Step two - Determine impact​

Step three - Communicate with Veterans and Stakeholders​

Step four - Diagnosis and determine path to resolution​

Step Five - After incident​

Step one - Acknowledge the incident

Step two - Determine impact

Step three - Communicate with Veterans and Stakeholders

Step four - Diagnosis and determine path to resolution

Step Five - After incident